Algorithms for Molecular Biology
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match Algorithms for Molecular Biology's content profile, based on 15 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit.
Diseth, A. C.; Puglisi, S. J.
Show abstract
Given a sequence S of subsets of symbols drawn from an alphabet of size{sigma} , a subset rank query srank(i, c) asks for the number of subsets before the ith subset that contain the symbol c. It was recently shown (Alanko et al., Proc. SIAM ACDA, 2023) that subset rank queries on the spectral Burrows-Wheeler lead to efficient k-mer lookup queries, an essential and widespread task in genomic sequence analysis. In this paper we design faster subset rank data structures that use small space--less than 3 bits per k-mer. Our experiments show that this translates to new Pareto optimal SBWT-based k-mer lookup structures at the low-memory end of the space-time spectrum.
Shur, A.; Tziony, I.; Orenstein, Y.
Show abstract
Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k- 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and near-optimal densities, but they require to explicitly store k-mer ranks in{Omega} (2k) space. While constant-space minimizers exist, and some of them are proven to be asymptotically optimal, no constant-space minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w[≥] k- 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer scheme.
Frost, H. R.
Show abstract
We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.
Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.
Show abstract
Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely difficult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments. Full online versionhttps://www.biorxiv.org/content/10.1101/2025.11.20.689557
Melsted, P.; Guthnyjarson, E. M.; Nordal, J.
Show abstract
We present a GPU implementation of kallisto for RNA-seq transcript quantification. By redesigning the core algorithms: pseudoalignment, equivalence class intersection, and the EM algorithm; for massively parallel execution on GPUs, we achieve a 30-50x speedup over multithreaded CPU kallisto. On a benchmark of 100 Geuvadis samples from Human cell lines the GPU version processes paired-end reads at a rate of 3.6 million per second, completing a typical sample in seconds rather than minutes. For a large dataset of 295 million reads, runtime drops from 40 minutes to 50 seconds. Our implementation demonstrates that careful algorithmic redesign, rather than naive porting of software, is necessary to fully exploit the computing power of GPUs in sequence analysis.
Hendrychova, V.; Brinda, K.
Show abstract
One important question in bacterial genomics is how to represent and search modern million-genome collections at scale. Phylogenetic compression effectively addresses this by guiding compression and search via evolutionary history, and many related methods similarly rely on tree- and ordering-based heuristics that leverage the same underlying phylogenetic signal. Yet, the mathematical principles underlying phylogenetic compression remain little understood. Here, we introduce the first formal framework to model phylogenetic compression mechanisms. We study genome collections represented as RLE-compressed SNP, k-mer, unitig, and uniq-row matrices and formulate compression as an optimization problem over genome orderings. We prove that while the problem is NP-hard for arbitrary data, for genomes following the Infinite Sites Model it becomes optimally solvable in polynomial time via Neighbor Joining (NJ). Finally, we experimentally validate the models predictions with real bacterial datasets using an exact Traveling Salesperson Problem (TSP). We demonstrate that, despite numerous simplifying assumptions, NJ orderings achieve near-optimal compression across dataset types, representations, and k-mer ranges. Altogether, these results explain the mathematical principles underlying the efficacy of phylogenetic compression and, more generally, the success of tree-based compression and indexing heuristics across bacterial genomics.
Hung, L.-H.; Yeung, K. Y.
Show abstract
To accommodate rapid methodological turnover, bioinformatics pipelines typically consist of discrete binaries linked via scripts. While flexible, this architecture relies on intermediate files, sacrificing performance, and treating complex codebases as static silos. For example, the STAR aligner [1]--the standard engine for transcriptomics--uses an external script for adapter trimming, necessitating the decompression and re-compression of large files. These limitations presented scalability problems for uniform processing of data in the NIH MorPhiC consortium. We present our solution, STAR Suite, a human-engineered and AI-implemented modernization that integrates functionality directly into the C++ source. In just four months, a single developer added over 92,000 lines to the original 28,000-line codebase to produce four unified modules: STAR-core, STAR-Flex, STAR-Perturb, and STAR-SLAM that can be installed as a pre-compiled binary without introducing any new dependencies. This work demonstrates a new paradigm for the rapid evolution of high-performance bioinformatics software.
Kierzek, E.; Shabangu, T. S.; Hiltke, O. M.; Miaro, M.; Arteaga, S.; Znosko, B. M.; Jolley, E. A.; Bevilacqua, P. C.; SantaLucia, J.; SantaLucia, H. A.; Lin, H.; Metkar, M.; Aviran, S.; Soszynska-Jozwiak, M.; Kierzek, R.; Mathews, D. H.
Show abstract
Nearest neighbor analysis is commonly used to estimate RNA folding stabilities. In this contribution, we report a set of RNA folding nearest neighbor parameters for estimating free energy change for RNA sequences including 1-methyl-pseudouridine. Development of mRNA vaccines has identified 1-methyl-pseudouridine as a key nucleobase modification for suppressing innate immune responses. However, the contributions of these modifications to RNA folding stability were unclear. Our new parameters provide helical terms for 1-methyl-pseudouridine-adenine and 1-methyl-pseudouridine-guanine base pairs. The parameters also estimate loop stabilities for loops with 1-methyl-pseudouridine or a combination of 1-methyl-pseudouridine and uridine. These parameters are derived using 208 optical melting experiments and tested against an additional 16 optical melting experiments. On average, we find that substitution of uridine with 1-methyl-pseudouridine stabilizes RNA folding, with the extent of stabilization depending on adjacent sequence. The estimation of tRNA folding ensembles for tRNA sequences with 1-methyl-pseudouridine was significantly improved using the new nearest neighbor parameters. The new nearest neighbor parameters are provided as part of the RNAstructure software package. With these parameters, the secondary structures of natural sequences with 1-methyl-pseudouridine and mRNA therapeutics fully substituted with 1-methyl-pseudouridine can be modeled.
Durbin, R.
Show abstract
Skiplists (Pugh, 1990) are probabilistic data structures over ordered lists supporting [O] (log N) insertion and search, which share many properties with balanced binary trees. Previously we introduced the graph Burrows-Wheeler transform (GBWT) to support efficient search over pangenome path sets, but current implementations are static and cumbersome to build and use. Here we introduce two doubly-linked skiplist variants over run-length-compressed BWTs that support [O] (log N) rank, access and insert operations. We use these to store and search over paths through a syncmer graph built from Edgars closed syncmers, equivalent to a sparse de Bruijn graph. Code is available in rskip.[ch] within the syng package at github.com/richarddurbin/syng. This builds a 5.8 GB lossless GBWT representation of 92 full human genomes (280Gbp including all centromeres and other repeats) single-threaded in 52 minutes, on top of a 4GB 63bp syncmer set built in 37 minutes. Arbitrarily long maximal exact matches (MEMs) can then be found as seeds for sequence matches to the graph at a search rate of approximately 1Gbp per 10 seconds per thread.
Ormond, C.; Ryan, N. M.; Corvin, A.; Heron, E. A.
Show abstract
SummaryBICEP is a Bayesian inference model that evaluates how likely a rare variant is to be causal for a genomic trait in pedigree-based analyses. The original prior model in BICEP was designed for single nucleotide variants only. Here, we have developed an extension of the prior models for more comprehensive genomic analysis to include indels and copy number variants. We benchmark the performance of these new priors and show comparable performance accuracy with the existing single nucleotide variant prior model. For copy number variants we evaluate four different input predictors to the models and recommend the best performing ones as the default. Availability and implementationthe updated prior models have been implemented in the current version of BICEP available from: https://github.com/cathaloruaidh/BICEP.
Plachy, J.; Sladky, O.; Brinda, K.; Vesely, P.
Show abstract
The growing interest in k-mer-based methods across bioinformatics calls for compact k-mer set representations that can be optimized for specific downstream applications. Recently, masked superstrings have provided such flexibility by moving beyond de Bruijn graph paths to general k-mer superstrings equipped with a binary mask, thereby subsuming Spectrum--Preserving String Sets and achieving compactness on arbitrary k-mer sets. However, existing methods optimize superstring length and mask properties in two separate steps, possibly missing solutions where a small increase in superstring length yields a substantial reduction in mask complexity. Here, we introduce the first method for Pareto optimization of k-mer superstrings and masks, and apply it to the problem of compressing pan-genome k-mer sets. We model the compressibility of masked superstrings using an objective that combines superstring length and the number of runs in the mask. We prove that the resulting optimization problem is NP-hard and develop a heuristic based on iterative deepening search in the Aho-Corasick automaton. Using microbial pan-genome datasets, we characterize the Pareto front in the superstring-length/mask-run space and show that the front contains points that Pareto-dominate simplitigs and matchtigs, while nearly encompassing the previously studied greedy masked superstrings. Finally, we demonstrate that Pareto-optimized masked superstrings improve pan-genome k-mer set compressibility by 12-19% when combined with neural-network compressors.
Martayan, I.; Lobet, L.; Marchet, C.; Paperman, C.
Show abstract
Modern sequencing pipelines routinely produce billions of reads, yet the dominant storage formats (FASTQ and FASTA) are text-based and sequential, making high-throughput parsing a persistent bottleneck in bioinformatics. Their regular, line-oriented structure makes them well-suited to SIMD vectorization, but existing libraries do not fully exploit it. We present vectorized algorithms for high-throughput FASTA/Q parsing, with on-the-fly handling of non-ACTG characters and built-in bitpacking of DNA sequences into multiple compact representations. The parsing logic is expressed as a finite state machine, compiled into efficient SIMD programs targeting both x86 and ARM CPUs. These algorithms are implemented in Helicase, a Rust library exposing a tunable interface that retrieves only caller-requested fields, minimizing unnecessary work. Exhaustive benchmarks across a wide range of CPUs show that Helicase meets or exceeds the throughput of all evaluated state-of-the-art libraries, making it the fastest general-purpose FASTA/Q parser to our knowledge. Availabilityhttps://github.com/imartayan/helicase.
Marcil, W. A.
Show abstract
This paper introduces a geometric complementary code (GCC) that bridges discrete digital pixel graphics with continuous analog wave mechanics in a geometric morphometrics framework. By oscillating four shaded cubic pixels in a Cartesian grid, an emergent pattern resembling a face-centered cubic (FCC) unit cell lattice appears. This pattern is modeled first in lattice space-- incorporating polarities of the sagittal, transverse, and coronal planes traditionally applied to Cartesian space--and subsequently in Cartesian space as a topographic medium. In the Cartesian model, topographic values divide into rise values, where the grid converges toward elevated features along the Y-axis, and run values, where it flares within terrain dips. This produces a surface grid that undulates like a propagating wave. Across both models, a polar continuum emerges, oscillating between crossed and uncrossed polarities at micro- and macro-grid scales when applied at the topographic tile level. Each tile oscillates to generate counter-oscillatory perspectives across the macro-grid, dynamically shifting between approaching-point and vanishing-point modes. The GCC functions as a hierarchical pixel-based morphometry. It begins with pixel-scale analysis in a single ZX plane, advances to atomic-scale resolution across four ZX planes, and extends to topographic tile-scale across 16 ZX planes. This progression reveals a geometric expansion of FCC patterns within a nested 2x2 matrix processor. Grounded in a consistent Pythagorean grid-count relationship, the framework maps discrete pixel states onto continuous wave-like surface behavior. By addressing limitations in current geometric morphometrics--where 3D scanning and semilandmark methods remain anchored in discrete landmarks or sparse points rather than detailed continuous topographic dimensions--the GCC offers a novel hierarchical bridge between digital and analog domains.
Large, A. L.; Holmes, I. H.
Show abstract
The TKF92 model of molecular evolution--a linear birth-death process for indels, with finite-state continuous-time Markov chain substitutions--is exchangeable in residue identity at every site: the generative process treats amino acids symmetrically, conditional on a single substitution rate matrix. To introduce local heterogeneity, evolutionary models are often equipped with site-class mixtures, preserving this symmetry in the sense of de Finetti: conditional on the latent class, residues are still exchangeable. In a four-step theoretical ladder, we show how long-range structure such as couplings between distant sites can also be introduced exchangeably by using a Dirichlet process to partition sites into co-evolving classes. Our first step is a thorough analysis of TKF92 to establish sufficient statistics, limiting behavior, and inferential tools. We then lift the pairwise TKF92 hidden Markov model, in the limit of small time, to a time-indexed gravestone-augmented pair stochastic context-free grammar, and from there to its phylogenetic generalisation. This framing allows trajectories to be sampled exactly by Inside-Outside recursion. The third step places a Dirichlet process over the alive sites and asks co-keyed sites to evolve under a sparse Potts interaction -- an exchangeably-partitioned hidden direct-coupling model whose marginal alignment likelihood is unchanged from plain TKF92. The fourth rung of the ladder develops inference machinery: a Gibbs-Metropolis sampler that alternates alignment resamples, key-partition resamples, and stochastic parameter updates. We close several gaps along the way -- exact closed-form sufficient statistics for the linear birth-death-immigration component, the resolvable LHopital limit at{lambda} =, and a closed-form M-step for a recursive generalisation of TKF92 -- and we report a 1,000-family Pfam fit with K=4 site classes whose Potts atoms carry [~]0.54 nats of covariation per class-pair on top of a substantial single-site substitution model. Supplementary material, including full source code for inference, may be found at https://tkfdp.net/.
Beeloo, R.; Groot Koerkamp, R.
Show abstract
MotivationSearching short DNA patterns such as barcodes, primers, or CRISPR spacers within sequencing reads or genomes is a fundamental task in bioinformatics. These problems are instances of multiple approximate string matching (MASM) [Baeza-Yates and Navarro, 1997], which requires locating all occurrences with up to k errors of multiple patterns of length m in a text of length n. Classical approaches based on seeding with exact matches become inefficient for short patterns (m [≤] 64 bp) as k increases, producing either many spurious hits or missing true matches. Our previous work, Sassy1, showed that careful hardware optimization drastically accelerates single-pattern searches in long texts by distributing chunks of the text across SIMD lanes. MethodsSassy2 distributes multiple patterns across SIMD lanes to maximize parallelism when searching batches of short patterns. When k is small, often only a short substring of the pattern of length O(k) is needed to reject a possible match. Thus, Sassy2 first examines short suffixes of the patterns (e.g., the last 16 bp of 32 bp patterns), allowing more (but smaller) parallel SIMD lanes. Only positions passing this suffix filter undergo full pattern verification. ResultsOn synthetic data, Sassy2 achieves 10-50x speedups over Sassy1 for short texts (n [≤] 200 bp) and 2-4x for large texts (n [≥] 1 Mbp). On real-world tasks with 16 threads, Sassy2 reaches over 100 Gbp/s text throughput per guide when searching 312 gRNAs across the human genome and 116 Gbp/s throughput when demultiplexing Nanopore reads with 96 barcodes. In both cases, Sassy2 outperforms Sassy1 by 2-5x and Edlib by 20-45x. AvailabilitySassy2 is implemented in Rust and available at github.com/RagnarGrootKoerkamp/sassy.
Wu, H.; Medvedev, P.
Show abstract
Estimating mutation rates between evolutionarily related sequences is a central problem in molecular evolution. Due to the rapid expansion of datasets, modern methods avoid costly alignment and instead focus on comparing sketches of sets of constituent k-mers. While these methods perform well on many sequences, they are not robust to highly repetitive sequences such as centromeres. In this paper, we present three new estimators that are robust to the presence of repeats. The estimators are applicable in different settings, based on whether they need count information from zero, one, or both of the sequences. We evaluate our estimators empirically using highly repetitive alpha satellite sequences. Our estimators each perform best in their class and our strongest estimator outperforms all other tested estimators. Our software is open-source and freely available on https://github.com/medvedevgroup/Accurate_repeat-aware_kmer_based_estimator.
Conchon-Kerjan, E.; Rouze, T.; Robidou, L.; Ingels, F.; Limasset, A.
Show abstract
Approximate membership query structures are used throughout sequence bioinformatics, from read screening and metagenomic classification to assembly, indexing, and error correction. Among them, Bloom filters remain the default choice. They are not the most efficient structures in either time or memory, but they provide an effective compromise between compactness, speed, simplicity, and dynamic insertions, which explains their widespread adoption in practice. Their main drawback is poor cache locality, since each query typically requires several random memory accesses. Blocked Bloom filters alleviate this issue by restricting accesses for any given element to a single memory block, but this usually comes with a loss in accuracy at fixed memory. In this work, we introduce the Super Bloom Filter, a Bloom filter variant designed for streaming k-mer queries on biological sequences. Super Bloom uses minimizers to group adjacent k-mers into super-k-mers and assigns all k-mers of a group to the same memory block, thereby amortizing random accesses over consecutive k-mer queries and improving cache efficiency. We further combine this layout with the findere scheme, which reduces false positives by requiring consistent evidence across overlapping subwords. We provide a theoretical analysis of the construction of Super Bloom filters, showing how minimizer density controls the expected reduction in memory transfers, and derive a practical parameterization strategy linking memory budget, block size, collision overhead, and the number of hash functions to robust false-positive control. Across a broad range of memory budgets and numbers of hash functions, Super Bloom consistently outperforms existing Bloom filter implementations, with several-fold time improvements. As a practical validation, we integrated it into a Rust reimplementation of BioBloom Tools, a sequence screening tool that builds filters from reference genomes and classifies reads through k-mer membership queries for applications such as host removal and contamination filtering. This replacement yields substantially faster indexing and querying than both the original C++ implementation and Rust variants based on Bloom filters and blocked Bloom filters. The findere scheme also reduces false positives by several orders of magnitude, with some configurations yielding no observed false positives among 109 random queried k-mers. Code is available at https://github.com/EtienneC-K/SuperBloom and https://github.com/Malfoy/SBB.
De Maio, N.
Show abstract
Maximum likelihood phylogenetic methods are popular approaches for estimating evolutionary histories. These methods do not assume prior hypotheses regarding the shape of the phylogenetic tree, and this lack of prior assumptions can be useful in particular in case of idiosyncratic sampling patterns. For example, the rate at which species are sequenced can differ widely between lineages, with lineages more of interest to humans being usually sequenced more often than others. However, in some settings sampling can be lineage-agnostic. In genomic epidemiology, for example, the sequencing rate can change through time or across locations, but is often agnostic to the specific pathogen strain being sequenced. In this scenario, one expects that the abundance of a pathogen strain at a specific time and location in the host population is reflected in the relative abundance of that strain among the genomes sequenced at that time and location. Here, I show that this simple assumption, when appropriate and incorporated within maximum likelihood phylogenetics, can greatly improve the accuracy of phylogenetic inference. This is similar to the famous medical principle "when you hear hoofbeats, think of horses, not zebras". In our application this means that, when for example observing a (possibly incomplete) genome sequence that has a similar likelihood of belonging to multiple different strains, I aim to prioritize phylogenetic placement onto a common strain (the "horse", a common disease) rather than a rare one (the "zebra", a rare disease). I introduce and assess two separate approaches to achieve this. The first approach rescales the likelihood of a phylogenetic tree by the number of distinct binary topologies obtainable by arbitrarily resolving multifurcations in the tree. This approach is based on a new interpretation of multifurcating phylogenetic trees particularly relevant at low divergence: multifurcations represent a lack of signal for resolving the bifurcating topology rather than an instantaneous multifurcating event, and so a multifurcating tree is interpreted as the set of bifurcating trees consistent with the multifurcating one, rather than as a single multifurcating topology. The second approach instead includes a tree prior that assumes that genomes are sequenced at a rate proportional to their abundance. Both approaches favor phylogenetic placement at abundant lineages, and using simulations I show that both methods dramatically improve the accuracy of phylogenetic inference in scenarios like SARS-CoV-2 phylogenetics, where large multifurcations are common. This considerable impact is also observed in real pandemic-scale SARS-CoV-2 genome data, where accounting for lineage prevalence reduces phylogenetic uncertainty by around one order of magnitude. Both approaches were implemented as part of the free and open source phylogenetic software MAPLE v0.7.5.4 (https://github.com/NicolaDM/MAPLE).
Sohail, M. A.; Sudharshan, R. R.; Pradhan, S. S.; Rao, A.
Show abstract
We present a new Hamiltonian-learning framework based on time-resolved measurement data from a fixed local IC-POVM and its application to inferring gene regulatory networks. We introduce the quantum Hamiltonian-based gene-expression model (QHGM), in which gene interactions are encoded as a parameterized Hamiltonian that governs gene expression evolution over pseudotime. We derive finite-sample recovery guarantees and establish upper bounds on the number of time and measurement samples required for accurate parameter estimation with high probability, scaling polynomially with system size. To recover the QHGM parameters, we develop a scalable variational learning algorithm based on empirical risk minimization. Our method recovers network structure efficiently on synthetic benchmarks and reveals novel, biologically plausible regulatory connections in Glioblastoma single-cell RNA sequencing data, highlighting its potential in cancer research. This framework opens new directions for applying quantum-like modeling to biological systems beyond the limits of classical inference.
Vasylenko, L.; Livnat, A.
Show abstract
At the fundamental conceptual level, two alternatives have traditionally been considered for how mutations arise and how evolution happens: 1) random mutation and natural selection, and 2) Lamarckism. Recently, the theory of Interaction-based Evolution (IBE) has been proposed, according to which mutations are neither random nor Lamarckian, but are influenced by information accumulating internally in the genome over generations. Based on the estimation-of-distribution algorithms framework, we present a simulation model that demonstrates nonrandom, non-Lamarckian mutation concretely while capturing indirectly several aspects of IBE: selection, recombination, and nonrandom, non-Lamarckian mutation interact in a complementary fashion; evolution is driven by the interaction of parsimony and fit; and random bits do not directly encode improvement but enable generalization by the manner in which they connect with the rest of the evolutionary process. Connections are drawn to Darwins observations that changed conditions increase the rate of production of heritable variation; to the causes of bell-shaped distributions of traits and how these distributions respond to selection; and to computational learning theory, where analogizing evolution to learning in accord with IBE casts individuals as examples and places the learned hypothesis at the population level. The model highlights the importance of incorporating internal integration of information through heritable change in both evolutionary theory and evolutionary computation.